iT邦幫忙

第 12 屆 iThome 鐵人賽

DAY 4
0
AI & Data

Machine Learning系列 第 4

Day-4 Feature Engineering -- 2. Categorical Encoding(3)

  • 分享至 

  • xImage
  •  

2.1 One hot encoding
2.2 Count and Frequency encoding
2.3 Target encoding / Mean encoding
2.4 Ordinal encoding
2.5 Weight of Evidence
2.6 Rare label encoding
2.7 Helmert encoding
2.8 Probability Ratio Encoding
2.9 Label encoding
2.10 Feature hashing
2.11 Binary encoding & BaseN encoding

將使用這個data-frame,有兩個獨立變數或特徵(features)和一個標籤(label or Target),共有十筆資料。

import pandas as pd
import numpy as np
data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
        'Color': ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
        'Target':[1,1,1,0,1,0,1,0,1,1]}

df = pd.DataFrame(data, columns = ['Temperature', 'Color', 'Target'])
Rec-No Temperature Color Target
0 Hot Red 1
1 Cold Yellow 1
2 Very Hot Blue 1
3 Warm Blue 0
4 Hot Red 1
5 Warm Yellow 0
6 Warm Red 1
7 Hot Yellow 0
8 Hot Yellow 1
9 Cold Yellow 1

補充 2.5 Weight of Evidence 加上程式範例
首先計算 Temperature 的每個類別中,屬於 Target(1) 及 Target(0) 的百分比。例如 Hot 類別中 Target是1, 有三筆;Target是0,有一筆。所以在Hot類別中,target=1的百分比是0.75,target=0的百分比是0.25。

# target = 1 i.e. Good = 1
woe_df = df.groupby('Temperature')['Target'].mean()
woe_df = pd.DataFrame(woe_df)
# remove the column name 'Target' to 'Good'
woe_df = woe_df.rename(columns={'Target':'Good'})
# Calculate Bad probability :  1 - Good probability
woe_df['Bad'] = 1-woe_df.Good
# add a small vlaue to avoid divide by zero in denominator 
# 加入一數值 避免被除數=0
woe_df['Bad'] = np.where(woe_df['Bad']==0, 0.000001, woe_df['Bad'])
# Compute the WoE
woe_df['WoE'] = np.log(woe_df.Good/woe_df.Bad)
woe_df
/ Good Bad WoE
Temperature
Cold 1.000000 0.000001 13.815511
Hot 0.750000 0.250000 1.098612
Very Hot 1.000000 0.000001 13.815511
Warm 0.333333 0.666667 -0.693147

計算出WOE,我們將WOE值加入原來資料中

# Map the WOE value back to each row of data-frame
# 將 WOE 加入資料集的每一筆資料
df.loc[:, 'WoE_Encode'] = df['Temperature'].map(woe_df['WoE'])
df
/ Temperature Color Target WoE_Encode
0 Hot Red 1 1.098612
1 Cold Yellow 1 13.815511
2 Very Hot Blue 1 13.815511
3 Warm Blue 0 -0.693147
4 Hot Red 1 1.098612
5 Warm Yellow 0 -0.693147
6 Warm Red 1 -0.693147
7 Hot Yellow 0 1.098612
8 Hot Yellow 1 1.098612
9 Cold Yellow 1 13.815511

2.8 Probability Ratio Encoding

Probability Ratio Encoding 類似 Weight of Evidence(WoE),唯一的不同是這個方法使用比例(Ratio)而不是自然對數(Natural Log)。

# target = 1 i.e. Good = 1
pr_df = df.groupby('Temperature')['Target'].mean()
pr_df = pd.DataFrame(pr_df)
# remove the column name 'Target' to 'Good'
pr_df = pr_df.rename(columns={'Target':'Good'})
# Calculate Bad probability :  1 - Good probability
pr_df['Bad'] = 1-pr_df.Good
# add a small vlaue to avoid divide by zero in denominator 
# 加入一數值 避免被除數=0
pr_df['Bad'] = np.where(pr_df['Bad']==0, 0.000001, pr_df['Bad'])
# Compute the Probability Ratio
pr_df['PR'] = pr_df.Good/pr_df.Bad
pr_df
/ Good Bad WoE
Temperature
Cold 1.000000 0.000001 1.000000
Hot 0.750000 0.250000 3.0
Very Hot 1.000000 0.000001 1.000000
Warm 0.333333 0.666667 0.5

計算出Probability Ratio value,我們將值加入原來資料中

# Map the Probability Ratio value back to each row of data-frame
# 將 Probability Ratio value 加入資料集的每一筆資料
df.loc[:, 'PR_Encode'] = df['Temperature'].map(pr_df['PR'])
df
/ Temperature Color Target WoE_Encode
0 Hot Red 1 3.0
1 Cold Yellow 1 1.000000
2 Very Hot Blue 1 1.000000
3 Warm Blue 0 0.5
4 Hot Red 1 3.0
5 Warm Yellow 0 0.5
6 Warm Red 1 0.5
7 Hot Yellow 0 3.0
8 Hot Yellow 1 3.0
9 Cold Yellow 1 1.000000

2.9 Label encoding

這個方法給每個類別一個1到N數字,N個是類別的總數。這個方法有一個缺點是,即使類別之間沒有順序等關係,這個方法仍會認為類別間有順序或其他關係存在。例如下面例子看起來似乎有(Cold < Hot < Very Hot < Warm...0 < 1< 2 < 3)關係存在。

使用Scikit-learn

from sklearn.preprocessing import LabelEncoder
df['Temp_label_encoded'] = LabelEncoder().fit_transform(df.Temperature)
df
/ Temperature Color Target Temp_label_encoded
0 Hot Red 1 1
1 Cold Yellow 1 0
2 Very Hot Blue 1 2
3 Warm Blue 0 3
4 Hot Red 1 1
5 Warm Yellow 0 3
6 Warm Red 1 3
7 Hot Yellow 0 1
8 Hot Yellow 1 1
9 Cold Yellow 1 0
也可使用Pandas的 **factorize **
df.loc[:, 'Temp_factorize_encode'] = pd.factorize(df['Temperature'])[0].reshape(-1,1)
df
/ Temperature Color Target Temp_factorize_encoded
0 Hot Red 1 0
1 Cold Yellow 1 1
2 Very Hot Blue 1 2
3 Warm Blue 0 3
4 Hot Red 1 0
5 Warm Yellow 0 3
6 Warm Red 1 3
7 Hot Yellow 0 0
8 Hot Yellow 1 0
9 Cold Yellow 1 1

上一篇
Day 3 Feature Engineering - 2. Categorical Encoding(2)
下一篇
Day-5 Feature Engineering -- 2. Categorical Encoding(4)
系列文
Machine Learning32
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言